In [1]:
%autosave 10


Autosaving every 10 seconds

Intro

  • Used by search engines, Google, and Yandex.
  • Many machine learning competition winners use it.

Basics 101

  • Examples with samples.
  • Features, n of them.
  • Response is real (regression) or -1/1 (classification)
  • Goal
    • Find some function that minimises error on unseen data.

 Decision trees

  • Classification and Regression Trees (CART), Breiman et al 1984
  • Binary, splits features on thresholds, output real (regression).
  • sklearn.tree.DecisionTreeClassifier/Regressor
  • leaves contain constant predictions
  • decision trees are very interpretable
    • can plot or see
  • but they have very poor predicive performance
    • seldom used alone.
    • usually use ensembles (random forests, bagging, boosting)
    • sklearn.ensemble

GBRTs

Advantages

  • can work on features with different scales
    • e.g. face detection or text classification have similar scales
  • can change loss function
    • robust loss functions like huber
  • non-linear feature interactions
    • don't need prior knowledge in kernel
  • not a black box (like SVM or neural networks)

disadvantage

  • lots of turning
  • slow to train (fast to use)
  • like other tree-based methods, can't extrapolate

Boosting

  • AdaBoost
    • Each member of ensemble is expert on the errors of its predecessor.
    • Iterative: reweight based on errors.
    • sklearn.ensemble.AdaBoostClassifer/Regressor
  • Viola-Jones Face Detector (2001) used it very successfully, seminal usage

J Friedman, 1999

  • Generalise to arbitrary loss functions
  • sklearn.ensemble.GradientBoostingClassifier/Regressor
  • Written in pure Python/Numpy, easy to extend.
  • Very shallow trees using custom node splitter and pre-sorting.

In [2]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_hastie_10_2

X, y = make_hastie_10_2(n_samples=10000)
est = GradientBoostingClassifier(n_estimators=200, max_depth=3)
est.fit(X, y)

pred = est.predict(X)
est.predict_proba(X)[0]  # class probabilities


Out[2]:
array([ 0.0325271,  0.9674729])

In [4]:
for pred in est.staged_predict(X):
    plt.plot(X[:, 0], pred, color='r', alpha=0.1)


KeyboardInterrupt
  • As you add more levels, you reduce variance, overfitting trickles in

In [ ]:
# X_test/Y_test, held back data

test_score = np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
    test_score[i] = est.loss_(y_test, pred)
plt.plot(np.arange(n_estimators) + 1, test_score, label='Test')
plt.plot(np.arange(n_estimators) + 1, est.train_score_, label='Train')

Tree structure

  • max_depth controls degree of feature interactions
    • e.g. for geo data need at least 2 to capture long/lat interactions.
  • Friedman suggests max depth of 3-5, presenter uses 3-6.
  • min_samples requires sufficient samples, adds more bias, adds contraint, more general.

 Shrinkage

  • Slow learning using learning_rate, but needs higher n_estimators
  • Takes longer to train but lowers test error and difference between train and test error.

Stochastic Gradient Boosting

  • subsample: random subset of training set
  • max_features: random subset of features
    • Presenter recommends starting with just this.
  • increased accuracy.

How to tune hyperparameters (best practices)

  1. Set n_estimators high as possible, e.g. 3000
  2. Tune via grid search.
    • param_grid
    • `gs_cv = GridSearchCV(est, param_grid).fit(X, y)
    • gs_cv.best_params_
    • Can also use joblib.
  3. Set n_estimators even higher and tune learning_rate

Case study

  • GBRT can directly minimise Mean Absolute Error (MAE).
  • Some methods like Random Forests act on MAE through Sum of Squares as a proxy
    • This directly emphasises outliers, which increases MAE
  • GBRT can capture lat/long of geo coordinate interaction
  • est.features_importances_ lets you peek in the black box, plot to see what are the most relevant features, great for epxloratory phase.
    • But don't say how features interact with each other.
    • use partial_dependence for PD plots
    • sklearn.ensemble.partial_dependence
    • very convenient, and the computation is cheap
    • automatically detect spatial effects

from sklearn.ensemble import partial_dependence as pd features = ['foo', 'bar'] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names]

  • Very flexible, general, non-parametric
  • Solid, battle tested

Questions

  • Reference for heuristics?
    • R package gdm is great, and heavily referenced and source of heuristics.
  • Why this house pricing case study?
    • Just general interest.
  • Prediction of time-series?
    • General problem for tree-based data.
    • Try to de-trend data before.
    • To be honest haven't been exposed to time-series prediction problems.
    • Maybe transform data into prediction of each tree, then run linear model (!!AI ?)
  • What if you want to do classification?
    • Maybe use spline regression to transform the problem into a regression problem.